Skip to content

Conversation

@jasha26
Copy link
Contributor

@jasha26 jasha26 commented Dec 15, 2025

This PR introduces a comprehensive microbenchmark suite for gcsfs to measure the performance of I/O operations under various conditions. The framework is built using pytest and the pytest-benchmark plugin, providing detailed performance metrics for single-threaded, multi-threaded, and multi-process scenarios.

Key Changes:

  • Microbenchmark Framework: A new microbenchmark suite has been added under gcsfs/tests/perf/microbenchmarks/. It includes:

    • A detailed README.md with instructions on how to set up and run the benchmarks.
    • An orchestrator script run.py that simplifies the process of running benchmarks and generating reports.
    • Pytest fixtures in conftest.py for setting up and tearing down the test environment, including creating temporary files of specified sizes.
  • Read Benchmarks: The initial implementation focuses on read performance with the following features:

    • Patterns: Supports both sequential (seq) and random (rand) read patterns.
    • Parallelism: The benchmarks can be run in single-threaded, multi-threaded, and multi-process modes. The number of threads and processes is configurable.
    • Configuration: Key parameters such as file size, block size, and chunk size can be easily configured through environment variables.
  • Configuration and Settings:

    • Benchmark parameters are defined in gcsfs/tests/perf/microbenchmarks/read/parameters.py and can be customized using environment variables as defined in gcsfs/tests/settings.py.
    • The benchmarks are skipped by default and can be enabled by setting the GCSFS_BENCHMARK_SKIP_TESTS environment variable to false.
  • Reporting:

    • The run.py script generates a timestamped directory containing detailed results in both JSON and CSV formats.
    • A summary table is printed to the console for a quick overview of the results, including min/mean latency and max throughput.

For Reviewers:

  • Please review the overall design of the microbenchmark framework.
  • The core logic for the read benchmarks is in gcsfs/tests/perf/microbenchmarks/read/test_read.py.
  • The configuration and parameterization can be reviewed in gcsfs/tests/perf/microbenchmarks/read/configs.py and gcsfs/tests/settings.py.
  • The README.md file in the microbenchmarks directory should be reviewed for clarity and completeness.
  • Note that these benchmarks are designed to run against live GCS buckets and are skipped in the standard CI pipeline.

This new benchmark suite will be a valuable tool for a more data-driven performance analysis and to ensure that gcsfs remains highly performant.

* Fixing Block size and consistency options in Extended GCSFS Open (#34)

* Separate versioned and non-versioned tests to use different bucket

* Update cleanup logic in tests to empty the bucket instead of deleting the bucket

* Add consistency and blocksize for ext gcsfs

* Merge conflicts resolution

* bucket type paramet for fs test for block size and consistency

* removed unused mocks variables from some tests

* fixing lint errors

* fixed small issue with core tests so it can run with exp flag true

---------

Co-authored-by: Mahalaxmibejugam <60227368+Mahalaxmibejugam@users.noreply.github.com>

* pytest microbenchmarks for seq and random reads both single multi threaded

* multiprocess benchmarks

* script to run tests

* undo settings for bucket names and logs

* benchmark script updates

* file size and bucket type decorators

* file size configuration

* removed zonal config

* Added README

* Readme update

* Moving settings and fixture to tests root

* Readme update

* Readme update

* Ignore benchmark pytests in CI

* benchmark hook fix

* adding skip tests flag

* benchmark plugin conditional enablement

* Fixing PR Comments, simplifying the configuration by doing auto gen

* Fixing PR Comments

* default settings

---------

Co-authored-by: Mahalaxmibejugam <60227368+Mahalaxmibejugam@users.noreply.github.com>
@ankitaluthra1
Copy link
Collaborator

/gcbrun

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have very little to say on this - I'm excited to see the results and what improvements your efforts can make!


## Introduction

This document describes the microbenchmark suite for `gcsfs`. These benchmarks are designed to measure the performance of various I/O operations under different conditions. They are built using `pytest` and the `pytest-benchmark` plugin to provide detailed performance metrics for single-threaded, multi-threaded, and multi-process scenarios.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: comments are harder with these long lines, recommends sticking to 80chr limits.

I would also test against concurrent.interpreter since it is now available in py3.14, and also free-threaded if all the upstream dependencies support it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

concurrent.interpreter is something i will take in the future PRs

You can install them using pip:

```bash
pip install -r gcsfs/tests/perf/microbenchmarks/requirements.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many are using tools like uv and others to specify the dependencies and run command in a more declarative way rather than descriptive README text like this. Might be worth thinking about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i have added dependencies to env file as well for conda run to work OOB, i guess for non conda scenarions we will have to live with requirements for now. This is something we can revisit in the future.

benchmark_plugin_installed = False


@pytest.fixture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Session scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want this fixture to create new files on every test run, hence keeping it functions scoped. If we create files once per session, then the results will include caching done by GCS which we don't want here.

def wrapper():
base_cases = base_cases_func()
new_cases = []
for case in base_cases:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't this kind of thing be done with parametrize?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now moved to yaml, so this is no longer the case

return wrapper


def _get_bucket_name_for_type(bucket_type: str) -> str:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a dictionary :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to a map in conftest so it can be reused in other tests also in the future

return wrapper


def with_threads(base_cases_func: Callable) -> Callable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also seems like code duplication, suggests these generator functions can be factored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generators are now removed, moved configuration to a yaml and converted the yaml values to parameters

Comment on lines 21 to 24
@with_bucket_types(["regional", "zonal"])
@with_file_sizes
@with_threads
@with_processes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it doesn't matter, since this will only get written once, but decorators are less easy to understand that simple sequential code. This ends up working like parametrized - which is fine and a well established pattern. But I wonder if it could have been written simpler within the function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, removed the decorators and moved to yaml based config

@@ -0,0 +1,7 @@
# Dependencies for running the performance benchmarks
gcsfs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this refer to the local directory explicitly? We are assuming the user has already installed gcsfs from source.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks for the feedback.

* Fixing Block size and consistency options in Extended GCSFS Open (#34)

* Separate versioned and non-versioned tests to use different bucket

* Update cleanup logic in tests to empty the bucket instead of deleting the bucket

* Add consistency and blocksize for ext gcsfs

* Merge conflicts resolution

* bucket type paramet for fs test for block size and consistency

* removed unused mocks variables from some tests

* fixing lint errors

* fixed small issue with core tests so it can run with exp flag true

---------

Co-authored-by: Mahalaxmibejugam <60227368+Mahalaxmibejugam@users.noreply.github.com>

* pytest microbenchmarks for seq and random reads both single multi threaded

* multiprocess benchmarks

* script to run tests

* undo settings for bucket names and logs

* benchmark script updates

* file size and bucket type decorators

* file size configuration

* removed zonal config

* Added README

* Readme update

* Moving settings and fixture to tests root

* Readme update

* Readme update

* Ignore benchmark pytests in CI

* benchmark hook fix

* adding skip tests flag

* benchmark plugin conditional enablement

* Fixing PR Comments, simplifying the configuration by doing auto gen

* Fixing PR Comments

* default settings

* added resource monitoring for benchmarks

* minor refactoring

* moved config to yaml

* lint fixes

* config yaml update for full read suite

* undo zonal file logging changes

* simplify single threaded read

* bringing back requirements

* update readme

* csv generation fix when some tests fail

* psutil install in cloudbuild

* psutil install in cloudbuild

* Removing zonal conditional code and updating config

---------

Co-authored-by: Mahalaxmibejugam <60227368+Mahalaxmibejugam@users.noreply.github.com>
@ankitaluthra1
Copy link
Collaborator

/gcbrun

@ankitaluthra1
Copy link
Collaborator

/gcbrun

* Fixing Block size and consistency options in Extended GCSFS Open (#34)

* Separate versioned and non-versioned tests to use different bucket

* Update cleanup logic in tests to empty the bucket instead of deleting the bucket

* Add consistency and blocksize for ext gcsfs

* Merge conflicts resolution

* bucket type paramet for fs test for block size and consistency

* removed unused mocks variables from some tests

* fixing lint errors

* fixed small issue with core tests so it can run with exp flag true

---------

Co-authored-by: Mahalaxmibejugam <60227368+Mahalaxmibejugam@users.noreply.github.com>

* pytest microbenchmarks for seq and random reads both single multi threaded

* multiprocess benchmarks

* script to run tests

* undo settings for bucket names and logs

* benchmark script updates

* file size and bucket type decorators

* file size configuration

* removed zonal config

* Added README

* Readme update

* Moving settings and fixture to tests root

* Readme update

* Readme update

* Ignore benchmark pytests in CI

* benchmark hook fix

* adding skip tests flag

* benchmark plugin conditional enablement

* Fixing PR Comments, simplifying the configuration by doing auto gen

* Fixing PR Comments

* default settings

* added resource monitoring for benchmarks

* minor refactoring

* moved config to yaml

* lint fixes

* config yaml update for full read suite

* undo zonal file logging changes

* simplify single threaded read

* bringing back requirements

* update readme

* csv generation fix when some tests fail

* psutil install in cloudbuild

* psutil install in cloudbuild

* Removing zonal conditional code and updating config

* parallel file creation in setup

* merge conflicts

* merge conflicts - lint fixes

* lint issues

---------

Co-authored-by: Mahalaxmibejugam <60227368+Mahalaxmibejugam@users.noreply.github.com>
@ankitaluthra1
Copy link
Collaborator

/gcbrun

@ankitaluthra1
Copy link
Collaborator

/gcbrun

@ankitaluthra1
Copy link
Collaborator

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants